A/B Testing with t-Tests

Organizations with a data-driven analytical focus often run hundreds of experiments each year as part of deploying analytic models or other changes to see if the model/change had the desired effect. Experiments are designed to confirm or reject a hypothesis. Remember from middle school science class the process: (1) formulate a hypothesis, (2) design an experiment, (3) collect the data, and (4) draw conclusions. Rather than teaching statistical analysis in depth, this section will hit the highlights of this process, especially as it applies to data analytics.1

A/B Testing

The most common experimental design in data analytics is the A/B test. It is an experiment with two groups with the goal of establishing which of two products, web pages, treatments, etc. is superior—for example, testing two web page ads to determine which generates more conversions or testing two headlines to determine which produces more clicks. Usually, one of the two treatments is the standard existing treatment called the control. The other treatment is the new treatment. Subjects (web visitors, patients, etc. that are exposed to the treatments) in an experiment should be randomly assigned to the two treatments.

Why is the control group necessary? Why not just apply the treatment to one group and compare it to past historical data? The reason is that the control group is subject to the same conditions as the treatment group, other than the treatment. In a comparison to past historical data, other factors might differ besides the treatment. Using a control group ensures that all other things are equal, and any difference between the treatment group and the control group is due to either the direct effect of the different treatments or chance.

Significance Test (Hypothesis Tests)

Significance tests help determine whether random chance might be responsible for the observed difference between groups A and B. The assumption is that the treatments are equivalent, and any difference is due to chance (called the null hypothesis H0).

The alternative hypothesis HA is that the outcomes for groups A and B are more different than what random chance might produce. Another way it is often taught is that the null hypothesis says that there is no difference between the samples, and the alternative hypothesis says that there is a difference. For example, H0: might say that the new web page B has the same influence on some outcome (sales) as the current page A. While HA: might say that the new web page B has a greater influence on some outcome (sales) than the current page A. We can express these two hypotheses as:

  • H0: The two means are equal (any difference is purely random)

  • HA: The two means are not equal (the difference is too great to be purely random)

Statistical significance measures whether a result is more extreme than what random chance alone might produce. In other words, in conducting an experiment with two samples from the same population, there will always be some difference between the means. The purpose of statistical analysis is to determine if this difference is purely a random chance, or if the difference is more than what a random chance could generate. A result where only random chance is extremely unlikely is labeled statistically significant.

What is the measure of “extremely unlikely”? We usually set a threshold of unlikeliness that chance results must surpass, known as alpha level, at 5% or 1%. This is also called the significance level and is the probability of rejecting the null hypothesis when it is actually true.

Given a chance model (the null hypothesis model), what is the probability of a result this extreme? This is known as the p-value. In data analytics, a p-value is useful to know whether a model result is within the range of normal chance variability. A p-value larger than the alpha threshold means that the effect is not proven; it could be due to chance. A p-value smaller than the alpha threshold means that the effect is probably not due to chance, so it might be real (but still not proven). Provability or causation requires more than a statistically significant p-value.

Type I and Type II Errors

Two types of errors are possible in significance testing: Type I and Type II. A Type I error, known as alpha (α), is when we mistakenly conclude that an effect is real, when it is due to chance. A Type II error, known as beta (β), is when we mistakenly conclude that an effect is due to chance, when it is real.2

Significance tests protect against being fooled by random chance, and thus minimize type I errors by defining an acceptable alpha (usually 5% or 1%).

The most common way to reduce a type II error is to increase the sample size. A larger sample size is required to detect a smaller effect. There is some danger here though, because with a large enough sample size, almost any effect can appear statistically significant even if it is not practically useful. Which type of error is the most important (and should be minimized) for a problem in a management decision? Figure 11.1 summarizes these types of errors.

Figure 11.1: Hypothesis Testing Types of Errors

Difference of Means

The most common significance test for the A/B experimental design is to compare the means of two populations to see if they are different from each other. In this case, the null and alternative hypotheses are H0: µ1 = µ2 and HA: µ1 ≠ µ2. To do this, we compare the sample means X1 and X2 of random samples of the populations along with their distributions to see if the distributions substantially overlap. If they do, the null hypothesis is supported; if not, the alternative hypothesis is supported. There are three popular methods to test the difference in means: Student’s t-test, Welch’s t-test, and Wilcoxon rank-sum test.

Student’s t-Test

The purpose of the Student’s t-test is to determine if the means of two separate populations are the same (H0) or if they are different (HA). And this test is based on two sample sets of data. We saw earlier that the t-distribution is much like the normal distribution, but adjusted for smaller sample sizes. So to run the t-test, we will make three assumptions:

  1. That the two populations are normally distributed

  2. That the variances (and standard deviations) are essentially the same (closer than a ratio of 3:1)

  3. That the observations in the samples are independent

The t-statistic, which we will label T, is a measure, in terms of standard deviation, of how far apart the two mean values are. For example, in a normal distribution, a value x that is greater than 1.96 times the standard deviation has a 95% probability that it is not equal to the mean. Since the t-statistic also includes the sample size, the critical t value is greater than the number 1.96 for a t-distribution. Hence, to reject the null hypothesis that two means are equivalent, we want a t value greater than the critical t value.

Because the t-distribution relates values of t to probabilities, we can also go directly to probability values. Usually, we test at the 95% level, or stating it as 1-p, we test at the 5% level. Another more stringent test is done at the 1% level. This is the alpha test mentioned above. So we test either for alpha = .05 or alpha = .01. Either method, testing for T or testing for alpha, gives the same result of accepting or rejecting the null hypothesis.

Even though we do not need to manually calculate a T value, we present the equations to show how T is calculated.

First we calculate a “pooled standard deviation” using the individual variances from each sample:

sp = sqrt((n1-1)*s12 + (n2-1)*s22)/(n1+n2-2))

Then we calculate T, which is the t statistic for our samples for n1+n2-2 degrees of freedom (DF). m1 and ms are the sample means:

T = (m1-m2)/ sp*(sqrt(1/n1+1/n2))

A large absolute value of T is unlikely, so we would reject the null hypothesis. T will be either positive or negative depending on the order of subtraction of the two means. Or if the p-value that corresponds to the T is less than the chosen alpha, then we can also reject the null hypothesis.

Using Excel for Student’s t-Test

Excel has good support for running the Student’s t-test with the Analysis ToolPak. To use the ToolPak properly, there are a few things you must prepare. As an example, let’s assume we are testing out two different web pages where customers can order products. We take sample income amounts for 40 days for each of the web pages. Figure 11.2 shows the incomes for the two pages along with their means, variances, and standard deviations.

Figure 11.2: Web Page Income Data for t-Test

To use the ToolPak, do the following steps:

  1. Determine if the variances of the two populations are equal. Use the VAR.S(range) function for each range and compare the two variances. There are different schools of thought on how close they need to be in order to be considered equal. One rule of thumb by some statisticians is if the ratio of the larger to the smaller is less than 3:1, then the variances can be treated as equal.

    For the web page example, the ratio of the two variances is 1.069, so they are well within the ratio of 3:1.

  2. Open the Data > Analysis ToolPak and scroll down to t-Test: Two-Sample Assuming Equal Variances. Figure 11.3 shows the dialog window.

    Figure 11.3: ToolPak Dialog Window
  3. Enter the two sample ranges (include the header labels in the range if desired and check the labels box). The Hypothesized Mean Difference should be 0. Enter Alpha (usually 0.05 or 0.01). Enter the beginning cell of the output range, leaving room for 14 rows and 3 columns. Click OK. Figure 11.4 shows this dialog window.

    Figure 11.4: Dialog Window for t-Test
  4. Use either the t-statistic or the p-value to interpret the results. Figure 11.5 shows the results of the analysis.

Figure 11.5: Results of t-Test

The last five rows in the table present the results information. In our case, we will use the two-tailed test since we want to test the difference in both directions, larger or smaller.

Using the t-statistic: We can use the t-statistic to see if the results are statistically significant at the alpha level. Find the t Stat and the t Critical two-tail value. If the absolute value of the t Stat is larger than the t Critical (of whichever tail test is best for your problem), then we can reject the null hypothesis and say that the population means are different. If not, we fail to reject the null hypothesis because we do not have enough evidence that the population means are different.

The t-stat row in the results is the T value that was calculated at -4.924. The t Critical two-tailed value is 1.9908. Since the absolute value of T is much larger than the t Critical value, we reject the null hypothesis that the two means are equal.

Using the p-value: Similarly, we can use the p-values to see if the results are statistically significant at the alpha level. Find the P(T<=t) two-tail (or the P(T<=t) one-tail if the sample means are very different). If the p-value is smaller than the alpha, then we can reject the null hypothesis and say that the population means are different.

Be careful to note that often the p-value is given in scientific notation because it is so small. A p-value of 4.63535-06 means that we move the decimal six positions to the left, so the number is really 0.00000463535, which is much smaller than either 0.05 or 0.01. Therefore, we reject the null hypothesis that the means are equal.

However, if the p-value were larger than alpha, then we would fail to reject the null hypothesis because we do not have enough evidence that the population means are different.

Paired Two-Sample t-Test

Let’s say the means of two samples contain two observations for each object of study. In other words, each data point in sample one can be paired with a data point in sample two. For example, this might occur for Test 1 and Test 2 for each student to see if they did statistically significantly better on Test2 after participating in an improved study program. We can pair the observations using a paired samples t-test.

In the Analysis ToolPak, choose t-Test: Paired Two-Sample for Means. The procedure to run this test and the interpretation is the same as for the Student’s t-test above.

Welch’s t-Test

If two populations are normally distributed but do not have the same variance, instead of using the Student’s t-test, we can use Welch’s t-test. The calculation of Welch’s t-test uses the sample variance for each population instead of the pooled variance, and the calculation of degrees of freedom is more complex. It is a more robust method than the Student’s t-test with fewer disadvantages, so some statisticians recommend using Welch’s instead of the Student’s method even when the samples have equal variances.3

In the Analysis ToolPak, choose t-Test: Two-Sample Assuming Unequal Variances. The procedure to run this test and the interpretation is the same as for the Student’s t-test above. You can see the Unequal Variances option in Figure 11.3 showing the ToolPak.

Let’s look at an example using the web pages data, but with WebPageB having a very large variance. Figure 11.6 shows the results of running the test. Notice that the ratio of the two variances is 6.49, which is much larger than the cutoff ratio of 3:1. Each dataset again has 40 entries, but the degrees of freedom is a more complex calculation and is only 51.

Figure 11.6: Welch’s t-Test Results

The absolute value of T = 1.48, which is smaller than either the t Critical value for either the one-tailed test (t = 1.675) or the two-tailed test (t = 2.007). The P value for the one-tailed test is .0722, which is also larger than either alpha of .05 or .01. Thus, for this example the null hypothesis is rejected that the means are equal.

Wilcoxon Rank-Sum Test

An underlying assumption for appropriate use of the previous t-tests was that the continuous outcome was approximately normally distributed or that the samples were sufficiently large (usually n1 > 30 and n2 > 30) to justify their use based on the Central Limit Theorem. When comparing two independent samples when the outcome is not normally distributed and the samples are small, a nonparametric test is appropriate.4

A popular nonparametric test to compare outcomes between two independent groups is the Mann Whitney U test. The Mann Whitney U test, sometimes called the Wilcoxon rank-sum test, is used to test whether two samples are likely to derive from the same population (i.e., that the two populations have the same shape). Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric t-tests compare the means (H0: μ12) between independent groups.

In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows:

  • H0: The two populations are equal

  • H1: The two populations are not equal

The Wilcoxon method ranks all of the observations, then sums the ranks (called rank-sums), then determines the probability of such rank-sums of such magnitude being observed assuming the populations are identical. Wilcoxon’s method does not assume anything about the population and is thus more robust than the t-test.

The procedure to run a Wilcoxon rank-sum test in Excel is somewhat time-consuming and is found on the Statology website. The tools used to run the procedure are usually statistical analysis programs such as R and SAS.